GitHub

您所在的位置：网站首页 › page object api nightwatchjsnightwatch wiki github › GitHub

GitHub

2023-08-19 03:19| 来源: 网络整理| 查看: 265

DeepSource

%matplotlib inline import matplotlib.pyplot as plt import torch.utils.data import torch.nn from random import randrange import os os.environ["WDS_VERBOSE_CACHE"] = "1" The WebDataset Format

WebDataset reads dataset that are stored as tar files, with the simple convention that files that belong together and make up a training sample share the same basename. WebDataset can read files from local disk or from any pipe, which allows it to access files using common cloud object stores. WebDataset can also read concatenated MsgPack and CBORs sources.

The WebDataset representation allows writing purely sequential I/O pipelines for large scale deep learning. This is important for achieving high I/O rates from local storage (3x-10x for local drives compared to random access) and for using object stores and cloud storage for training.

The WebDataset format represents images, movies, audio, etc. in their native file formats, making the creation of WebDataset format data as easy as just creating a tar archive. Because of the way data is aligned, WebDataset works well with block deduplication as well and aligns data on predictable boundaries.

Standard tools can be used for accessing and processing WebDataset-format files.

%%bash curl -s http://storage.googleapis.com/nvdata-openimages/openimages-train-000000.tar | tar tf - | sed 10q e39871fd9fd74f55.jpg e39871fd9fd74f55.json f18b91585c4d3f3e.jpg f18b91585c4d3f3e.json ede6e66b2fb59aab.jpg ede6e66b2fb59aab.json ed600d57fcee4f94.jpg ed600d57fcee4f94.json ff47e649b23f446d.jpg ff47e649b23f446d.json Related Projects the new torchdata library in PyTorch will add native (built-in) support for WebDataset the AIStore server provides high-speed storage, caching, and data transformation for WebDataset data WebDataset training can be carried out directly against S3, GCS, and other cloud storage buckets NVIDIA's DALI library supports reading WebDataset format data directly there is a companion project to read WebDataset data in Julia the tarp command line program can be used for quick and easy dataset transformations of WebDataset data WebDataset

WebDataset makes it easy to write I/O pipelines for large datasets. Datasets can be stored locally or in the cloud.

When your data is stored in the cloud, by default, local copies are downloaded and cached when you open the remote dataset.

import webdataset as wds url = "http://storage.googleapis.com/nvdata-publaynet/publaynet-train-{000000..000009}.tar" dataset = wds.WebDataset(url).shuffle(1000).decode("rgb").to_tuple("png", "json")

The resulting datasets are standard PyTorch IterableDataset instances.

isinstance(dataset, torch.utils.data.IterableDataset) True for image, json in dataset: break plt.imshow(image)

png

Let's add some code to transform the input data.

def preprocess(sample): image, json = sample try: label = json["annotations"][0]["category_id"] except: label = 0 return 1-image, label dataset = dataset.map(preprocess) for image, label in dataset: break plt.imshow(image) print(label) 1

png

Note that this uses the fluid interface to WebDataset, a convenient shorthand for a lot of training loops. We'll see later what this expands to.

Expanding Samples

Let's add another processing pipeline stage; this one expands a single large input sample into multiple smaller samples. We shuffle the newly generated sub-samples further to mix up sub-samples from different images in the stream.

This uses the .compose method, which takes a function that maps an interator over samples into another iterator over samples.

def get_patches(src): for sample in src: image, label = sample h, w = image.shape[:2] for i in range(16): y, x = randrange(h-256), randrange(w-265) patch = image[y:y+256, x:x+256] yield (patch, label) dataset = dataset.compose(get_patches).shuffle(10000) for image, json in dataset: break plt.imshow(image)

png

DataLoader

WebDataset is just an instance of a standard IterableDataset. It's a single-threaded way of iterating over a dataset.

Since image decompression and data augmentation can be compute intensive, PyTorch usually uses the DataLoader class to parallelize data loading and preprocessing. WebDataset is fully compatible with the standard DataLoader.

loader = torch.utils.data.DataLoader(dataset, num_workers=4, batch_size=8) batch = next(iter(loader)) batch[0].shape, batch[1].shape (torch.Size([8, 256, 256, 3]), torch.Size([8]))

The webdataset library contains a small wrapper that adds a fluid interface to the DataLoader (and is otherwise identical).

This comes in handy if you want to shuffle across dataset instances and allows you to change batch size dynamically.

loader = wds.WebLoader(dataset, num_workers=4, batch_size=8) loader = loader.unbatched().shuffle(1000).batched(12) batch = next(iter(loader)) batch[0].shape, batch[1].shape (torch.Size([12, 256, 256, 3]), torch.Size([12]))

It is generally most efficient to avoid batching in the DataLoader altogether; instead, batch in the dataset and then rebatch after the loader.

A complete pipeline then looks like this.

url = "http://storage.googleapis.com/nvdata-publaynet/publaynet-train-{000000..000009}.tar" dataset = wds.WebDataset(url).shuffle(1000).decode("rgb").to_tuple("png", "json").map(preprocess) dataset = dataset.compose(get_patches) dataset = dataset.batched(16) loader = wds.WebLoader(dataset, num_workers=4, batch_size=None) loader = loader.unbatched().shuffle(1000).batched(12) batch = next(iter(loader)) batch[0].shape, batch[1].shape (torch.Size([12, 256, 256, 3]), torch.Size([12])) Pipeline Interface

The wds.WebDataset fluid interface is just a convenient shorthand for writing down pipelines. The underlying pipeline is an instance of the wds.DataPipeline class, and you can construct data pipelines explicitly, similar to the way you use nn.Sequential inside models.

dataset = wds.DataPipeline( wds.SimpleShardList(url), # at this point we have an iterator over all the shards wds.shuffle(100), wds.split_by_worker, # at this point, we have an iterator over the shards assigned to each worker wds.tarfile_to_samples(), wds.shuffle(1000), wds.decode("torchrgb"), # at this point, we have an list of decompressed training samples from each shard in this worker in sequence get_patches, # note that can put iterator->iterator functions into the pipeline directly wds.shuffle(10000), wds.to_tuple("big.jpg", "json"), wds.batched(16) ) batch = next(iter(loader)) batch[0].shape, batch[1].shape (torch.Size([12, 256, 256, 3]), torch.Size([12])) Multinode Training

Multinode training in PyTorch and other frameworks is complex. It depends on how exactly you distribute training across nodes, whether you want to keep "exact epochs" (exactly and only one sample from the dataset per epoch), and whether your training framework can deal with unequal number of samples per node.

The simplest solution for multinode training is to use a resampling strategy for the shards, generating an infinite stream of samples. You then set the epoch length explicitly with the .with_epoch method.

dataset = wds.WebDataset(url, resampled=True).shuffle(1000).decode("rgb").to_tuple("png", "json").map(preprocess).with_epoch(10000) sample = next(iter(dataset))

Inside a pipeline, you can do the same thing using the ResampledShards generator. Shuffling and splitting by worker are then not needed.

dataset = wds.DataPipeline( wds.ResampledShards(url), # at this point we have an iterator over all the shards wds.tarfile_to_samples(), wds.shuffle(1000), wds.decode("torchrgb"), # at this point, we have an list of decompressed training samples from each shard in this worker in sequence get_patches, # note that can put iterator->iterator functions into the pipeline directly wds.shuffle(10000), wds.to_tuple("big.jpg", "json"), wds.batched(16) ).with_epoch(10000) batch = next(iter(loader)) batch[0].shape, batch[1].shape (torch.Size([12, 256, 256, 3]), torch.Size([12])) Installation and Documentation $ pip install webdataset

For the Github version:

$ pip install git+https://github.com/tmbdev/webdataset.git

Here are some videos talking about WebDataset and large scale deep learning:

Introduction to Large Scale Deep Learning Loading Training Data with WebDataset Creating Datasets in WebDataset Format Tools for Working with Large Datasets

Examples: (NB: some of these are for older versions of WebDataset, but the differences should be small)

loading videos splitting raw videos into clips for training converting the Falling Things dataset Dependencies

The WebDataset library only requires PyTorch, NumPy, and a small library called braceexpand.

WebDataset loads a few additional libraries dynamically only when they are actually needed and only in the decoder:

PIL/Pillow for image decoding torchvision, torchvideo, torchaudio for image/video/audio decoding msgpack for MessagePack decoding the curl command line tool for accessing HTTP servers the Google/Amazon/Azure command line tools for accessing cloud storage buckets

Loading of one of these libraries is triggered by configuring a decoder that attempts to decode content in the given format and encountering a file in that format during decoding. (Eventually, the torch... dependencies will be refactored into those libraries.)

Data Decoding

Data decoding is a special kind of transformations of samples. You could simply write a decoding function like this:

def my_sample_decoder(sample): result = dict(__key__=sample["__key__"]) for key, value in sample.items(): if key == "png" or key.endswith(".png"): result[key] = mageio.imread(io.BytesIO(value)) elif ...: ... return result dataset = wds.Processor(wds.map, my_sample_decoder)(dataset)

This gets tedious, though, and it also unnecessarily hardcodes the sample's keys into the processing pipeline. To help with this, there is a helper class that simplifies this kind of code. The primary use of Decoder is for decoding compressed image, video, and audio formats, as well as unzipping .gz files.

Here is an example of automatically decoding .png images with imread and using the default torch_video and torch_audio decoders for video and audio:

def my_png_decoder(key, value): if not key.endswith(".png"): return None assert isinstance(value, bytes) return imageio.imread(io.BytesIO(value)) dataset = wds.Decoder(my_png_decoder, wds.torch_video, wds.torch_audio)(dataset)

You can use whatever criteria you like for deciding how to decode values in samples. When used with standard WebDataset format files, the keys are the full extensions of the file names inside a .tar file. For consistency, it's recommended that you primarily rely on the extensions (e.g., .png, .mp4) to decide which decoders to use. There is a special helper function that simplifies this:

def my_decoder(value): return imageio.imread(io.BytesIO(value)) dataset = wds.Decoder(wds.handle_extension(".png", my_decoder))(dataset) Alternative Representation: CBOR

An alternative representation of collections of samples is based on the IETF CBOR standard, an efficient, binary representation of data structures. CBOR files are particularly useful for large collections of very small samples (data tuples, short strings, etc.)

import cbor import numpy as np

Writing CBOR files is very easy:

with open("test.cbors", "wb") as stream: for i in np.linspace(-5.0, 5.0, 1000): cbor.dump((i, str(i)[-5:]), stream)

Of course, you can these files directly:

with open("test.cbors", "rb") as stream: for i in range(3): print(cbor.load(stream)) [-5.0, '-5.0'] [-4.98998998998999, '98999'] [-4.97997997997998, '97998']

And CBOR files/shards integrate fully into DataPipeline with the cbors_to_samples function.

dataset = wds.DataPipeline( wds.SimpleShardList("test.cbors"), wds.cbors_to_samples(), ) data = list(iter(dataset)) len(data), data[0] (1000, [-5.0, '-5.0']) "Smaller" Datasets and Desktop Computing

WebDataset is an ideal solution for training on petascale datasets kept on high performance distributed data stores like AIStore, AWS/S3, and Google Cloud. Compared to data center GPU servers, desktop machines have much slower network connections, but training jobs on desktop machines often also use much smaller datasets. WebDataset also is very useful for such smaller datasets, and it can easily be used for developing and testing on small datasets and then scaling up to large datasets by simply using more shards.

Here are different usage scenarios:

desktop deep learning, smaller datasets copy all shards to local disk manually use automatic shard caching prototyping, development, testing of jobs for large scale training copy a small subset of shards to local disk use automatic shard caching with a small subrange of shards cloud training against cloud buckets use WebDataset directly with remote URLs on premises training with high performance store (e.g., AIStore) and fast networks use WebDataset directly with remote URLs on premises training with slower object stores and/or slower networks use automatic shard caching Location Independence, Caching, Etc.

WebDataset makes it easy to use a single specification for your datasets and run your code without change in different environments.

Loadable Dataset Specifications

If you write your input pipelines such that they are defined by a dataset specification in some language, you can most easily retarget your training pipelines to different datasets. You can do this either by dynamically loading the Python code that constructs the pipeline or by using a YAML/JSON dataset specification.

A YAML dataset specification looks like this:

dataset: - shards: gs://nvdata-ocropus-tess/ia1-{000000..000033}.tar scaleprob: 0.3 - shards: gs://nvdata-ocropus-tess/cdipsub-{000000..000022}.tar scale: [1.0, 3.0] - shards: gs://nvdata-ocropus-tess/gsub-{000000..000167}.tar scale: [0.4, 1.0] - shards: gs://nvdata-ocropus-tess/bin-gsub-{000000..000167}.tar extensions: nrm.jpg scale: [0.3, 1.0] - shards: gs://nvdata-ocropus/rendered.tar scaleprob: 1.0

Note that datasets can be composed from different shard collections, mixed in different proportions.

The dataset specification reader will be integrated in the next minor version update.

AIStore Proxy

If you want to use an AISTore server as a cache, you can tell any WebDataset pipeline to replace direct accesses to your URLs to proxied accesses via the AIStore server. To do that, you need to set a couple of environment variables.

export AIS_ENDPOINT=http://nix:51080 export USE_AIS_FOR="gs"

Now, any accesses to Google Cloud Storage (gs:// urls) will be routed through the AIS server.

URL Rewriting

You can rewrite URLs using regular expressions via an environment variable; the syntax is WDS_REWRITE=regex=regex;regex=regex.

For example, to replace gs:// accesses with local file accesses, use

export WDS_REWRITE="gs://=/shared/data/"

To access Google cloud data via ssh, you might use something like:

export WDS_REWRITE="gs://=pipe:ssh proxyhost gsutil cat " Use the Caching Mechanism

If you use the built-in caching mechanism, you can simply download shards to a local directory and specify that directory as the cache directory. The shards in that directory will override the shards that are being downloaded. Shards in the cache are mapped based on the pathname and file name of your shard names.

Direct Copying of Shards

Let's take the OpenImages dataset as an example; it's half a terabyte large. For development and testing, you may not want to download the entire dataset, but you may also not want to use the dataset remotely. With WebDataset, you can just download a small number of shards and use them during development.

!curl -L -s http://storage.googleapis.com/nvdata-openimages/openimages-train-000000.tar > /tmp/openimages-train-000000.tar dataset = wds.DataPipeline( wds.SimpleShardList("/tmp/openimages-train-000000.tar"), wds.tarfile_to_samples(), ) repr(next(iter(dataset)))[:200]

【本文地址】

GitHub

GitHub

今日新闻

推荐新闻